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Abstract. We compare recent approaches to community structure identification 
in terms of sensitivity and computational cost. The recently proposed modularity 
measure is revisited and the performance of the methods as applied to ad hoc networks 
with known community structure, is compared. We find that the most accurate 
methods tend to be more computationally expensive, and that both aspects need to 
be considered when choosing a method for practical purposes. The work is intended 
as an introduction as well as a proposal for a standard benchmark test of community 
detection methods. 



1. Introduction 

The study of complex networks has received an enormous amount of attention from the 
scientific community in recent years PJEHHHUEIIE]- Physicists in particular have become 
interested in the study of networks describing the topologies of a wide variety of systems, 
such as the world wide web, social and communication networks, biochemical networks 
and many more. An important open problem is the analysis of modular structure 
found in many networks [7j. Distinct modules or communities within networks can 
loosely be defined as subsets of nodes which are more densely linked, when compared 
to the rest of the network. Such communities have been observed in different kinds of 
networks, most notably in social networks, but also in networks of other origin such as 
metabolic or economic networks [El El EH E] • As a result, the problem of identification 
of communities has been the focus of many recent efforts. 

Community detection in large networks is potentially very useful. Nodes belonging 
to a tight-knit community are more than likely to have other properties in common. For 
instance, in the world wide web, community analysis has uncovered thematic clusters 
[T21 ITS] , In biochemical or neural networks, communities may be functional groups 
[Hj, and separating the network into such groups could simplify functional analysis 
considerably. 
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The problem of community detection is quite challenging and has been the subject 
of discussion in various disciplines. A simpler version of this problem, the graph bi- 
partitioning problem (GBP) has been the topic of study in the realm of computer 
science for decades. Here, one looks to separate the graph into two densely connected 
communities of equal size, which are connected with the minimum number of links. 
This is an NP complete problem^ [16] . however several methods have been proposed to 
reduce the complexity of the task [T71 [TBI EH [2D] • In real complex networks we often 
have no idea how many communities we wish to discover, but in general it is more than 
two. This makes the process all the more costly. What is more, communities may also 
be hierarchical, that is communities may be further divided into sub-communities and 

so on jUJ 1221 123 121 ■ 

Nevertheless, many attempts to tackle these problems have been proposed recently. 
The proposed methods vary considerably in terms of approach and application, which 
makes them difficult to compare. Community identification is potentially very useful 
and researchers from a number of fields may be interested in using one or several of the 
methods for their own purposes. But which? In order for the reader to be able to make 
an informed decision as to which method is most appropriate for which purpose, we 
distil information from the literature and compare the performance of those methods 
which lend themselves to objective comparison. 

To this end, this paper is organised as follows. In section 2 we revisit the modularity 
measure designed to evaluate how good a particular partition of a network is. Then, 
we describe how to measure the sensitivity of the various methods and suggest the use 
of a more accurate representation of algorithm sensitivity based on information theory. 
We then compare the methods from a computational cost perspective and compare 
their sensitivity when applied to ad hoc networks with community structure. Finally, 
we suggest appropriate choices of community identification methods for a few different 
problems. 

2. Evaluating community identification 

A question that has been raised in recent years is how a given partition of a network into 
communities can be evaluated. A simple approach that has become widely accepted was 
proposed in j2S|- It is based on the intuitive idea that random networks do not exhibit 
community structure. Let us imagine that we have an arbitrary network and an arbitrary 
partition of that network into n c communities. It is then possible to define a n c x n c 
size matrix e where the elements represent the fraction of total links starting at a 
node in partition % and ending at a node in partition j. Then, the sum of any row (or 
column) of e, a« = J2j e ij corresponds to the fraction of links connected to i. 

If the network does not exhibit community structure, or if the partitions are 

| In computational complexity theory, NP ('Non-deterministic Polynomial time') is the set of decision 
problems solvable in polynomial time on a non-deterministic Turing machine. NP-complete problems 
are the most difficult problems in NP. 
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allocated without any regard to the underlying structure, the expected value of the 
fraction of links within partitions can be estimated. It is simply the probability that a 
link begins at a node in i, a^, multiplied by the fraction of links that end at a node in 
i, Oj. So the expected number of intra-community links is just c^a;. On the other hand 
we know that the real fraction of links exclusively within a partition is en. So, we can 
compare the two directly and sum over all the partitions in the graph. 



This is a measure known as modularity. As an example, let us consider a network 
comprised of n c fully connected components with no links between them. If we then 
have n c partitions, corresponding exactly to the components, modularity will have a 
value of 1 — l/n c . As n c gets large, this value tends to 1. On the other hand, for 
particularly "bad" partitions, for example, when all the nodes are in a community of 
their own, the value of modularity can take negative values. This is due to the fact that 
when nodes are alone in partitions there can be no internal links. To avoid this issue, 
Massen & Doye propose an alternative measure |26j . 

It is tempting to think that random networks exhibit very small values of 
modularity. As Guimera et al. show, this is not the case [27]. It is possible to find 
a partition which not only has a nonzero value of modularity for random networks of 
finite size, but that this value is quite high, for example a network of 128 nodes and 
1024 links has a maximum modularity of 0.208. This suggests that these networks that 
cannot have a modular structure actually appear to have one due to fluctuations. 

3. Comparative evaluation 

The methods that have been presented recently are extremely varied, and are based on 
a range of different ideas. In a longer article, we describe the methods in more detail 
and classify them according to the type of approach they present [28.. Also, the full 
description of each can be found in the respective references. Here we concentrate on 
comparing the methods in terms of performance. In order for the reader to be able to 
compare the algorithms, both in terms of their speed and sensitivity, we would like to 
present a qualitative comparison for all the methods presented until now. However, this 
is not possible as they are very varied, both conceptually and in their applications. 

One way that has been employed to test sensitivity in many cases is to see how well 
a particular method performs when applied to ad hoc networks with a well known, fixed 
community structure [25] • Such networks are typically generated with n = 128 nodes, 
split into four communities containing 32 nodes each. Pairs of nodes belonging to the 
same community are linked with probability p in whereas pairs belonging to different 
communities are joined with probability p out . The value of p out is taken so that the 
average number of links a node has to members of any other community, z ou t, can be 
controlled. While p out (and therefore z out ) is varied freely, the value of p in is chosen to 
keep the total average node degree, k constant, and set to 16. As z out is increased from 




(1) 
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zero, the communities become more and more diffuse and harder to identify, (Figure P). 
Since the "real" community structure is well known in this case, it is possible to measure 
the number of nodes correctly classified by the method of community identification. 

In |21], the author describes a method to calculate this value. The largest group 
found within each of the four "real" communities is considered correctly classified. If 
more than one original community is clustered together by the algorithm, all nodes 
in that cluster are considered incorrectly classified. For example, for the case when 
Zout/k is small, if a method finds three communities, two of which correspond exactly 
to two original communities, and a third, which corresponds to the other two clustered 
together, this measure would consider half the nodes correctly classified. As the author 
notes, this measure is quite harsh, and some nodes which one may consider to be 
correctly clustered are not counted. On the other end of the spectrum, as z out /k becomes 
large, and the networks become essentially random networks, this method rewards the 
identification of smaller clusters found within each of the original communities, which 
could be misleading. 

We suggest that a more discriminatory measure is more appropriate, and propose 
the use of the normalised mutual information measure, as described in |2H] 130] . It 
is based on defining a confusion matrix N, where the rows correspond to the "real" 
communities, and the columns correspond to the "found" communities. The element 
of N, Nij is the number of nodes in the real community i that appear in the found 
community j. A measure of similarity between the partitions, based on information 
theory, is then: 



where the number of real communities is denoted ca and the number of found 
communities is denoted cb, the sum over row i of matrix Ny is denoted iVj. and the sum 
over column j is denoted 

If the found partitions are identical to the real communities, then I(A, B) takes its 
maximum value of 1. If the partition found by the algorithm is totally independent of 
the real partition, for example when the entire network is found to be one community, 



Both measures of accuracy give a good idea of how a method performs. However, the 
measure we propose for use here is more representative of sensitivity if the performance 
is dubious, since it measures the amount of information correctly extracted by the 
algorithm explicitly. As an example, for small z out , where two original communities 
are clustered together by the algorithm, this measure does not punish the algorithm as 
severely, taking into account the ability to extract at least some information about the 
community structure. On the other hand, for large z out , this method is able to detect 
that the clusters found by the algorithm have little to do with the original communities, 
and I (A, B) -> 0. 





I(A, B) = 0. 
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Table 1. Table summarising how the computational cost of different approaches scales 
with number of nodes n, number of links m and average degree (k) |42| . The labels 
shown here are used in Figures and |3J 

In Figure |2] we show the sensitivity of all methods we have been able to gather. 
The percentage of correctly identified nodes is calculated using the method described 
in |21], since this is the method employed by the various authors. We can see that 
accuracy varies in a similar way across the different methods as z out increases and the 
communities become more diffuse. So, it remains difficult to compare the performance 
by looking at the methods separately, even with a reference performance. 

To summarise the large amount of information, in Figure |3] we plot the fraction 
of correctly identified nodes for only three values of z out (6, 7 and 8), corresponding 
to z out /k = 0.375, 0.4375 and 0.5 respectively, for each method. From this we can see 
that most of the methods perform very well for z ou t = 6 (z ou t/k = 0.375), and even for 
z out — 7 (z out /k = 0.4375) most can identify more than half the nodes correctly. For 
z out — 8 (z out /k = 0.5) two methods are still able to identify more than 80 % of the 
nodes correctly§. 

While accuracy is an essential consideration when choosing a method, it is just as 
important to consider the computational effort needed to perform the analysis [42 . For 
some of the approaches described in the literature, we have collected estimates of how 
the cost scales with network observables. For networks with n nodes and m links, the 

§ One might expect that as the proportion of out links approaches 0.5 community structure no longer 
exist. However since the external links are distributed among the other three communities, individual 
nodes remain more strongly connected to their own community than to other communities, even at 
this high value of z out /k. 
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Figure 1. Algorithm sensitivity as applied to ad hoc networks with n — 128, the 
network divided into four communities of 32 nodes each and total average degree z ou t 
fixed to 16. For low z out /k the communities are easily distinguished. For higher z out /k 
this becomes more complicated. Both measures of comparing original communities to 
ones found by the detection method are shown. The normalised mutual information 
measure is more discriminatory and appears more sensitive to errors in the community 
identification procedure. The results are shown for Newman's fast algorithm [21] and 
the extremal optimisation algorithm |ril| . 



methods scale between 0{m + n) for the fastest, and 0(exp(n)) for the slowest (Table 
HJ). Such diversity is due to the different approaches taken by the authors. The faster 
methods tend to be approximate and less accurate, while the slower methods have other 
advantages (see (2H| for a more detailed discussion). Differences in speed only become 
important when dealing with larger networks. 

4. Choosing an algorithm 

One has to take many factors into account when choosing an algorithm to use. The 
above comparison ought to give the reader an idea as to which algorithm is most 
appropriate for a given problem. In many cases, a compromise must be reached between 
accuracy and running time, especially for larger networks. To clarify this further, here 
are a few examples of real networks, and our suggestion for the appropriate community 
identification algorithm. 

Say we want to analyse a relatively small network, for example the metabolic 
network of the worm Caenorhabditis elegans, which has 453 nodes. Since the network 
is small, and current desktop computer technology is reasonably fast, the speed of 
the algorithm should pose no restriction, and one is free to chose the slower, more 
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Figure 2. Comparing algorithm sensitivity using ad hoc networks with predetermined 
community structure. The a;-axis is the proportion of connections to outside 
communities z out /k and the y-axis is the fraction of nodes correctly identified by the 
method measure as described in [21]. The labels here correspond to the different 
methods and are listed in Table □ 



accurate methods. In this case the Simulated Annealing (SA) method would be the 
most appropriate choice, since it gives the most accurate partitions, especially if the 
system is allowed to cool slowly (see (2312111121 for more details). 

Larger networks, with the number of nodes in the order of 10 5 become intractable 
with the more accurate methods. For example, when attempting to study the 
community structure of the actor collaboration network with 374511 nodes, we estimate 
that the SA algorithm would take a few months of uninterrupted computation. However, 
a reasonable implementation of the fast algorithm would be able to perform this analysis 
in just a few hours jSj, making it the appropriate choice, even if it's accuracy is not 
the best. 

Let us consider an intermediate sized network such as the Pretty Good Privacy 
(PGP) web of trust social network [45 , containing 10680 nodes. Although the SA 
algorithm would run in a reasonable time, it may be a better choice to compromise and 
employ a faster running algorithm. The EO method is not quite as accurate as SA, but 
the saving in computational effort for a network of this size is considerable. It is more 
accurate than the fast algorithm however, and so would make it a better choice. 
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Figure 3. The fraction of correctly identified nodes at three specific values of z out , 
6, 7 and 8 for all available methods and for networks with fixed k — 16. Note that 
for the FLM method, the data for z ou t = 8 were not available. Here we can see that 
most of the methods are very good at finding the "correct" community structure for 
values of z out up to 6. At z out = 7 some methods begin to falter but most still identify 
more than half of the nodes correctly. At z out = 8, when on average half the links are 
external, two methods are still able to identify over 80 % of the nodes correctly. 

5. Conclusion 

In this work we have given a brief overview and comparison of the modern approaches 
to community identification in complex networks. A large amount of knowledge has 
been collected in the field, and real progress has been made, both in the identification 
of communities and their characterisation. Some questions do remain open, and it is 
these that we would suggest for further study. Despite these efforts, the cost involved in 
computing communities in complex network remains significant. The fastest algorithm 
runs in linear time, but this particular method needs a priori knowledge of the number 
of expected communities, and assumes that all communities are of similar size [HE] . At 
present, the fastest method for finding an unknown number of communities of unknown 
sizes has a cost which scales as 0(n\og 2 n) with network size. While this makes 
the analysis of extremely large networks feasible, this algorithm does not guarantee 
that the partition found is the best possible one. Other algorithms which are more 
computationally expensive have other merits, such as accuracy or the ability to identify 
overlapping communities. So, when choosing a method one must consider carefully the 
context of its use. Ideally, one would like to have a method which guarantees accuracy 
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and is fast at the same time, but finding such a method is challenging. The search for 
faster and more accurate methods is an important one and we would suggest this for 
further study. 
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